Search CORE

49 research outputs found

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Author: Fuegen Christian
Le Duc
Mahadeokar Jay
Moritz Niko
Seide Frank
Publication venue
Publication date: 19/04/2022
Field of study

The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are the RNN-Transducer (RNN-T) and the connectionist temporal classification (CTC) objectives. Both perform an alignment-free training by marginalizing over all possible alignments, but use different transition rules. Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T), which both can be realized using the graph temporal classification-transducer (GTC-T) loss function. Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time, often in an infinite loop. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible and unifiable with traditional FST-based hybrid ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way, though: By regularizing the training - via joint LAS training or parameter initialization from RNN-T - both MonoRNN-T and CTC-T perform as well - or better - than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers

Author: Kalinli Ozlem
Le Duc
Li Yang
Schubert Kjell
Seide Frank
Seltzer Michael L.
Wang Yuhao
Publication venue
Publication date: 02/11/2022
Field of study

We show how factoring the RNN-T's output distribution can significantly reduce the computation cost and power consumption for on-device ASR inference with no loss in accuracy. With the rise in popularity of neural-transducer type models like the RNN-T for on-device ASR, optimizing RNN-T's runtime efficiency is of great interest. While previous work has primarily focused on the optimization of RNN-T's acoustic encoder and predictor, this paper focuses the attention on the joiner. We show that despite being only a small part of RNN-T, the joiner has a large impact on the overall model's runtime efficiency. We propose to factorize the joiner into blank and non-blank portions for the purpose of skipping the more expensive non-blank computation when the blank probability exceeds a certain threshold. Since the blank probability can be computed very efficiently and the RNN-T output is dominated by blanks, our proposed method leads to a 26-30% decoding speed-up and 43-53% reduction in on-device power consumption, all the while incurring no accuracy degradation and being relatively simple to implement.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Directional Source Separation for Robust Speech Recognition on Smart Glasses

Author: Feng Tiantian
He Weipeng
Huang Yiteng
Kalgaonkar Kaustubh
Lei Xin
Lin Ju
Moritz Niko
Seide Frank
Sun Ming
Wan Li
Publication venue
Publication date: 19/09/2023
Field of study

Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Author: Alistarh Dan
Chilimbi Trishul
Devlin Jacob
Ho Qirong
Hoefler T.
Hoefler T.
Hoefler T.
Hsieh Kevin
Interface Forum Message Passing
Jayarajan Anand
Lian Xiangru
Recht B.
Seide Frank
Strom Nikko
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)

SparCML: High-Performance Sparse Communication for Machine Learning

Author: Abadi Martín
Alistarh Dan
Chen Tianqi
Chilimbi Trishul M
Devlin Jacob
Hoefler T.
Hoefler T.
Hoefler T.
Interface Forum Message Passing
Ketkar Nikhil
Sa Christopher De
Seide Frank
Strom Nikko
Wen Wei
Publication venue
Publication date: 01/01/2019
Field of study

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)

Marian: Fast Neural Machine Translation in C++

Author: Aji Alham Fikri
Birch Alexandra
Bogoychev Nikolay
Dwojak Tomasz
Germann Ulrich
Grundkiewicz Roman
Heafield Kenneth
Hoang Hieu
Junczys-Dowmunt Marcin
Martins André F.T.
Neckermann Tom
Seide Frank
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 04/04/2018
Field of study

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.Comment: Demonstration pape

arXiv.org e-Print Archive

Edinburgh Research Explorer

Towards an Automated Directory Information System

Author: Andreas Kellner
Frank Seide
Publication venue
Publication date
Field of study

This paper describes a design and feasibility study for a large-scale automatic directory information system with a scalable architecture. The current demonstrator, called PADIS-XL 1, operates in realtime and handles a database of a medium-size German city with 130,000 listings. The system uses a new technique of taking a combined decision on the joint probability over multiple dialogue turns, and a dialogue strategy that strives to restrict the search space more and more with every dialogue turn. During the course of the dialogue, the last name of the desired subscriber must be spelled out. The spelling recognizer permits continuous spelling and uses a context-free grammar to parse common spelling expressions. This paper describes the system architecture, our maximum a-posteriori (MAP) decision rule, the spelling grammar, and the dialogue strategy. We give results on the SPEECHDAT and SIETILL databases on recognition of first names by spelling and on jointly deciding on the spelled and the spoken name. In a 35,000-names setup, the joint decision reduced name-recognition errors by 31%. 1

CiteSeerX